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Abstract 

Stochastic gradient descent (SGD) and its variants have become more and more 
popular in machine learning due to their efficiency and effectiveness. To han¬ 
dle large-scale problems, researchers have recently proposed several parallel SGD 
methods for multicore systems. However, existing parallel SGD methods cannot 
achieve satisfactory performance in real applications. In this paper, we propose a 
fast asynchronous parallel SGD method, called AsySVRG, by designing an asyn¬ 
chronous strategy to parallelize the recently proposed SGD variant called stochas¬ 
tic variance reduced gradient (SVRG). Both theoretical and empirical results show 
that AsySVRG can outperform existing state-of-the-art parallel SGD methods like 
Hogwild! in terms of convergence rate and computation cost. 


1 Introduction 

Assume we have a set of labeled instances {(x^, yi)\i = 1,..., n}, where G is 
the feature vector for instance i, p is the feature size and yi G {1, —1} is the class label 
of Xi. In machine learning, we often need to solve the following regularized empirical 
risk minimization problem: 
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where w is the parameter to learn, /i(w) is the loss function defined on instance z, 
and often with a regularization term to avoid overfitting. For example, /i(w) can be 
log(l + which is known as the logistic loss, or max {0,1 — yixfwj which 

is known as the hinge loss in support vector machine (SVM). The regularization term 
can be | ||w||2, A ||w|| or some other forms. 

Due to their efficiency and effectiveness, stochastic gradient descent (SGD) and its 
variants HI] El nisi SI [lol El have recently attracted much attention to solve machine 
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learning problems like that in ([T]|. Many works have proved that SGD and its vari¬ 
ants can outperform traditional batch learning algorithms such as gradient descent or 
Newton methods in real applications. 

In many real-world problems, the number of instances n is typically very large. In 
this case, the traditional sequential SGD methods might not be efficient enough to hnd 
the optimal solution for Q. On the other hand, clusters and multicore systems have 
become popular in recent years. Hence, to handle large-scale problems, researchers 
have recently proposed several distributed SGD methods for clusters and parallel SGD 
methods for multicore systems. Although distributed SGD methods for clusters like 
those in (□□[ni are meaningful to handle very large-scale problems, there also 
exist a lot of problems which can be solved by a single machine with multiple cores. 
Furthermore, even in distributed settings with clusters, each machine (node) of the 
cluster typically have multiple cores. Hence, how to design effective parallel SGD 
methods for multicore systems has become a key issue to solve large-scale learning 
problems like that in Q- 

There have appeared some parallel SGD methods for multicore systems. The 
round-robin scheme proposed in ifT^ tries to order the processors and then each proces¬ 
sor update the variables in order. Hogwild! |I8] is a lock-free approach for parallel SGD. 
Experimental results in JS) have shown that Hogwild! can outperform the round-robin 
scheme in ca. However, Hogwild! can only achieve a sub-linear convergence rate. 
Hence, Hogwild! is not efficient (fast) enough to achieve satisfactory performance. 

In this paper, we propose a fast asynchronous parallel SGD method, called AsyS VRG, 
by designing an asynchronous strategy to parallelize the recently proposed SGD vari¬ 
ant called stochastic variance reduced gradient (SVRG) n. The contributions of 
AsySVRG can be outlined as follows: 

• Two asynchronous schemes, consistent reading and inconsistent reading, are pro¬ 
posed to coordinate different threads. Theoretical analysis is provided to show 
that both schemes have linear convergence rate, which is faster than that of Hog¬ 
wild! 

• The implementation of AsySVRG is simple. 

• Empirical results on real datasets show that AsySVRG can outperform Hogwild! 
in terms of computation cost. 


2 Preliminary 

We use /(w) to denote the objective function in ((l|, which means /(w) = ^ 

In this paper, we use H H to denote the L 2 -iiontiT’ll 2 denote the optimal so¬ 

lution of the objective function. 

Assumption 1. The function fi(-) {i = 1,... ,n) in ^ is convex and L-smooth, which 
means 3L > 0, Va, b, 


/t(a) < /i(b) -f V/,(b)^(a - b) -f ^ ||a - b| 
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or equivalently 


||V/.(a)-V/.(b)|| <L|la-b||, 
where V fi{-) denotes the gradient of 

Assumption 2. The objective function /(•) is ^--strongly convex, which means 3fj, > 0, 
Va,b, 

/(a) > /(b) + V/(b)^(a - b) + ^ ||a - bf , 

or equivalently 


||V/(a)-V/(b)|| >M||a-b| 


3 Algorithm 

Assume that we have p processors (threads) which can access a shared memory, and w 
is stored in the shared memory. Furthermore, we assume each thread has access to a 
shared data structure for the vector w and has access to choose any instance randomly 
to compute the gradient V/i(w). We also assume consistent reading of w, which 
means that all the elements of w in the shared memory have the same “age” (time 
clock). 

Our AsySVRG algorithm is presented in Algorithm[^ We can find that in the 
iteration, each thread completes the following operations: 

• All threads parallelly compute the full gradient V/(wt) = - 

Assume the gradients computed by thread a are denoted by fa which is a sub¬ 
set of {V/i(wt)|i = l,...,n}. We have faClfb = empty if a b, and 
ULi'/'a = {V/i(wt)|* = l,...,n}. 

• Run an inner-loop in which each iteration randomly chooses an instance indexed 

by im and computes the gradient where Uq = wt, and compute 

the vector 


Vm = - V/,,„(uo) -f V/(uo). (2) 

Then update the vector 


t^m+l — 


where p > 0 is a step size. 

Here, m is the total number of updates on w from all threads and k{m) is the 
u-iteration at which the update was calculated. Since each thread can compute 
an update and change the w, k{m) < m obviously. At the same time, we should 
guarantee that the update is not too old. Hence, we need m — k{m) < r, where r 
is a positive integer, and usually called the bounded delay. If r = 0, the algorithm 
AsySVRG degenerates to the sequential (single-thread) version of SVRG. 
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Algorithm 1 AsySVRG 

Initialization: p threads, initialize Wg, 77 ; 
for t =1,2,... do 

All threads parallelly compute the full gradient Vf{'Wt) = ^ 
uo = wt; 

For each thread, do: 
for m = 0 to M do 

Pick up an randomly from {1,..., n}; 

Compute update vector = V/j„(ufc(^)) - V/i„(uo) + V/(uo); 

rtm+l — m\ 

end for 

Option 1: Take Wj+i to be the current u in the shared memory; 

Option 2: Take Wj+i to be the average sum of generated by the inner loop; 

end for 


4 Convergence Analysis 

Our convergence analysis is based on the Option 2 in Algorithm [T] Please note that 
we have p threads and let each thread calculate M times of update. Hence, the total 
times of updates on w in the shared memory, which is denoted by M, must satisfy that 
0 < M < pM. And obviously, the larger the M is, the larger the M will be. 

4.1 Consistent Reading 

Since w is a vector with several elements, it is typically impossible to complete up¬ 
dating Wm+i in an atomic operation. We have to give this step a lock for each thread. 
More specihcally, it need a lock whenever a thread tries to read u or update u in the 
shared memory. This is called consistent reading scheme. 

First, we give some notations as follows: 

Pm,* = V/j(u„) - V/,(uo) + V/(uo), (3) 

1 ” 

9m = - V llPm.ill^ • (4) 

n ^' 

It is easy to hnd that = Pk(m),im update of w can be written as follows: 

t^m+l ~ V^m- (5) 

One key to get the convergence rate is the estimation of the variance of v^. We use 
the technique in Q and get the following result: 

Lemma 1. There exists a constant p > 1 such that < P^Qm+i- 
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Proof. 



(6) 


The fourth inequality uses Assumption [T] and r > 0 is a constant. Summing (|^ from 
i = 1 to n, and taking expectation about im, we have 



(7) 


We use c = 2 max { 7, rrf'L'^ } and choose r, rj such that 0 < c < 1, then we can get 
Qo ^ Please note that k{0) = 0. Then, we obtain that 


Eg™ < P^qm+i 

where p satisfies < P ~ 1^(1 + p'^)) > 1. 


( 8 ) 


□ 


According to Lemma p > 1. If we want p to be small enough, we need a small 
step size rj. This is reasonable because u should be changed slowly if the gradient 
applied to update u is relatively old. 


Theorem 1. With the Assumption^^and^ choosing a small step size p and large M, 
we have the following result: 


E(/(wt+i) - /(w*)) < aE(/(wt) - /(w*)) 



4.2 Inconsistent Reading 

The consistent reading scheme would cost much waiting time because we need a lock 
whenever a thread tries to read u or update u. In this subsection, we will introduce 
an inconsistent reading scheme, in which a thread does not need a lock when reading 
current u in the memory. For the update step, the thread still need a lock. Please 
note that our inconsistent reading scheme is different from that in |0 which adopts 
the atomic update strategy. Since the update vector applied to u is usually dense, the 
atomic update strategy used in El is not applicable for our case. 
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For convenience, we use U = {uq, Ui, ...} to denote the vector set generated in 
the inner loop of our algorithm, and u„i to denote the vector that one thread gets from 
the shared memory and uses to update u^- Then, we have 


Vm = V/*„(u™) - V/i„(uo) + V/(uo) 

^m+1 — U-m 


(9) 


We also need the following assumption: 

Assumption 3. For all threads, they enjoy the same speed of reading operation and the 
same speed of updating operation. And any reading operation is faster than updating 
operation, which means that for three scalars a, b, c, “6 = a” is faster than “a = a+c”. 

Since we do not use locks when a thread reads u in the shared memory, some 
elements in u which have not been read by one thread may be changed by other threads. 
Usually, Urn ^ U. If we call the age of each element of G U to be m, the ages 
of elements of Um may not be the same. We use a{m) to denote the smallest age 
of the elements of u„j. Of course, we expect that a(m) is not too small. Given a 
positive integer r, we assume that m — a{m) < r. With Assumption]^ according to 
the definition of Um and a{m), we have 



( 10 ) 


where is a set that belongs to {1, 2,... ,p} {i = 1, 2), Pg^ . G is a diagonal 
matrix that only the diagonal position is 1, Vk G gm,i, and other elements of Pg^ ^ 


are 0. 


The ( [T0| is right because with an update lock and Assumption]^ at most one thread 
is updating u at any time. If a thread begins to read u, only two cases would happen. 
One is that no threads are updating u, which leads Pg^ 2=0. Another is that one 
thread is updating u, which leads to the result that the thread would get a new u and 
may also get some old elements. Obviously, they enjoy the same age of u if it reads at 
a good pace. 

Then, we can get the following results: 

• a(m) ^ a{m + 1) or a{m + 1) ^ a{m). 

• gm,i n gm ,2 = 0, and gm.i Ugm,2 = {1,2,... ,p}, which means that Pg^ ^ + 
Pgm 2 = Ip, Ip an identity matrix. 

Similar to Q and Q, we give the following definitions: 


p,(x) = V/i(x) - V/,(uo) + V/(uo), 
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We can find that = p*(um) and E ||vm||^ = E(Ei^ ||vm||^) = Eg™- 

We give a notation for any two integers m, n, i.e., YT=n = Er=m = “5™^ ■ 

According to the proof in Lemma[T] we can get the property that Vx, y, r > 0, 

l|P*(x)f - ||p*(y)f < ^ ||p*(x)f llx-yf . (11) 

Lemma 2. There exists a constant p > 1 and a corresponding suitable step size rj that 
make: 


E(7m ^ P^Qm+1- 


Proof According to (11 1 , we have 


|pj(Um)||^ - ||pi(Um+i)||^ < - ||pi(Um)||^ + rL^ ||U„ - Um+lf . 

r 


According to ([T0|, we have 


( 12 ) 


||Um - u™+i|j 

II 112 

II ^a(m) ^a(m+l) “I” || 

a(m+l) —1 

<2 E llw - W+ill^ + 277^ ||va(m)|| +27?^ ||Va(m+l)|| 

l—a{m) 

a(m+l) 

<v E 

l—a{m) 

In the first inequality, a{m + 1) may be less than a{m), but it won’t impact the result 


of the second inequality. Summing from i = 1 to for (12 1 , we can get 


^ o(m+l) 

gm - gm+1 < -gm + 4r?7^L^ ^ 


Taking expectation to ia{Tn),ia{m)+i, ■ ■ ■ i*a(m+i) which are the random numbers se¬ 
lected by Algorithm [T] we obtain 

^ a{m+l) 

Eg™ - Eg^+i < -Eg^ -f drTy^L^ ^ E®. 


When p, r, rj satisfy the following condition: 


1 + Arrj^L 

1 _ i _ ~ ^ 

p{l — - — Arp'^LF'{T -f 1)^"^) > 1 + 
r 
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we have 


^Qm f^Qm+1 


□ 


Lemma 3. For the relation between ond g(u,„), we have the following result 

Egm < CiEg(u„). 


where Ci = rr^ 


1_ l_47-T-pTj^2^2 


> 1 . 


Proof. In (111, if we take x = Um and y = u^, we can obtain 

llPi(Um)f - ||p*(u,„)f < - |lp*(u„)f + rL^ ||u^ - u„ 

r 

Summing from i = 1 to n, we obtain 

Qm TL ||Urn 

r 


1 Ii2 

~~Qm “t“ tL -f- Pg^ 2 ^o(m) + l 

^ m— 1 

<-gm+4rL2 X! 

l—a{m) 

- m— 1 

= -qm +^ 


V/ 


l—a{7n) 


where the second inequality uses the fact that Pg^ ^ + Pg^ ^ = Ip. 
Taking expectation on both sides, and using Lemma]^ we get 


- Eg(u„) <(- + 4rrp^77^L^)E||vm||^ 
r 


which means that 


Egm < 


1 


1 — - — ArTp^rf‘L‘^ 


Eq(u„). 


(13) 


□ 


Similar to Theorem we have the following result about inconsistent reading: 

Theorem 2. Wlfh a suitable step size rj which satisfies the condition 

in Lemmai^pl and a large M, we can get our convergence result for the inconsistent 


reading scheme: 

E(/(wt+i) - /(w,)) < ( 


+ 


C2 


/iM(2?7 —C2) 2?7 — C2 


-)E(/(wt) - /(w*)), 


where = 
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Remark; In our convergence analysis for both consistent reading and inconsistent 
reading schemes, there are a lot of parameters, such as r, 77, p, r, M, p, L. We can set 
r = Since p, L are determined by the objective function and p, t are constants, we 
only need the step size 77 to be small enough and M to be large enough. Then, all the 
conditions in these lemmas and theorems will be satished. 


5 Experiments 

We choose logistic regression with a L 2 -norm regularization term to evaluate our 
AsySVRG. Hence, the /(w) is dehned as follows: 

" A 

/(w) = ^log(l + e-^-’'>) + -||w|^ 

i=l 

We choose Hogwild! as baseline because Hogwild! has been proved to be the 
state-of-the-art parallel SGD methods for multicore systems 0 . The experiments are 
conducted on a server with 12 Intel cores and 64G memory. 

5.1 Dataset and Evaluation Metric 

We choose three datasets for evaluation. They are rcvl, real-sim, and news20, which 
can be downloaded from the LibSVM website []] Detailed information is shown in 
Table [T] where A is the hyper-parameter in /(w). 


Table 1: Dataset 


dataset 

instances 

features 

A 

rcvl 

20,242 

47,236 

0.0001 

real-sim 

72,309 

20,958 

0.0001 

news 20 

19,996 

1,355,191 

0.0001 


We adopt the speedup and convergence rate for evaluation. The dehnition of speedup 
is as follows; 


speedup 


CPU time taken to get a suboptimal solution with one thread 
CPU time taken to get a suboptimal solution with p threads 


We get a suboptimal solution by stopping the algorithms when the gap between training 
loss and the optimal solution min {/(w)} is less than 10 “^. 

We set M in Algorithm 111 to be where n is the number of training instances 
and p is number of threads, when p = 1, the setting about M is the same as that in 
svRG a. According to our theorems, the step size should be small. However, we 
can also get good performance with a relatively large step size in practice. For the 
Hogwild!, in each epoch, we run each thread ^ iterations. We use a constant step size 
7, and we set 7 ^ O.97 after every epoch. These settings are the same as those in 

* http://www.csie.ntu.edu.tw/ cjlin/libsvmtools/datasets/ 
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the experiments in Hogwild!|[8l. For each epoch, our algorithm will visit the whole 
dataset three times and the Hogwild! will visit the whole dataset only once. To make 
a fair comparison about the convergence rate, we study the change of objective value 
versus the number of effective passes. One effective pass of the dataset means the 
whole dataset is visited once. 

5.2 Results 

In practice, we find that our AsyS VRG algorithm without any lock strategy, denoted by 
AsySVRG-unlock, can achieve the best performance. Tableshows the running time 
and speedup results of consistent reading, inconsistent reading, and unclock schemes 
for AsySVRG on dateset rcvl. Here, 77.15s denotes 77.15 seconds, 1.94x means the 
speedup is 1.94, i.e., it is 1.94 times faster than the sequential (one-thread) algorithm. 


Table 2: Lock versus Unlock (in second) 


threads 

consistent reading 

inconsistent reading 

AsySVRG-unlock 

2 

77.15s/1.94x 

77.20s/1.94x 

137.55s/1.09x 

4 

62.20s/2.4x 

51.06s/2.93x 

58.07s/2.58x 

8 

63.05s/2.4x 

53.93s/2.78x 

30.49s/4.92x 

10 

64.76s/2.3x 

56.29s/2.66x 

26s/5.77x 


We find that the consistent reading scheme has the worst performance. Hence, in 
the following experiments, we only report the results of inconsistent reading scheme, 
denoted by AsySVRG-lock, and AsySVRG-unlock. 

Table[^compares the time cost between AsySVRG and Hogwild! to achieve a gap 
less than 10“^ with 10 threads. We can find that our AsySVRG is much faster than 
Hogwild!, either with lock or without lock. 


Table 3: Time (in second) taken by 10 threads when the gap is less than 10 



AsySVRG-lock 

AsySVRG-unlock 

Hogwild!-lock 

Hogwild! -unlock 

rcvl 

55.77 

25.33 

>500 

>200 

real-sim 

42.20 

21.16 

>400 

>200 

news20 

909.93 

514.50 

>4000 

>2000 


Figure[^shows the speedup and convergence rate on three datasets. Here, AsySVRG- 
lock-10 denotes AsySVRG with lock strategy on 10 threads. Similar nations are 
used for other settings of AsySVRG and Hogwild!. We can find that the speedup of 
AsySVRG and Hogwild! is comparable. Combined with the results in Tablewe can 
find that Hogwild! is slower than AsySVRG with different numbers of threads. From 
Figure [2 we can also find that the convergence rate of AsySVRG is much faster than 
that of Hogwild!. 
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(a) rcvl (b) rcvl 


real-sitn 




(c) realsim 


(d) realsim 




(e) news20 


(f) news20 


Figure 1: Experimental results. Left; speedup; Right: convergence rate. Please note 
that in (b), (d) and (f), some curves are overlapped. 


6 Conclusion 

In this paper, we have proposed a novel asynchronous parallel SGD method, called 
AsySVRG, for multicore systems. Both theoretical and empirical results show that 
AsySVRG can outperform other state-of-the-art methods. 
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A Notations for Proof 

For the proof of Theorem [T] 

Proof. According to Q, we obtain that 

E ||u„+i - w*|j^ = E ||um - - 2?7 Ev^(u„ - w*) + p^E ||v™||^ 

= E ||um - w,||^ - 2pEV/(ufe(„))^(u„ - w*) +? 7 ^E||vm||^ . 

(14) 

For the old gradient, we have 

V/(ufc(^))^(u™ - w*) =V/(ufe(„))^(ufc(^) - w*) + V/(ufe(„))'^(u„ - Ufc(„)). 

(15) 

Since /(w) is L-smooth, we have 

V/(ufe( 

m ))'^(Ufe( m) - w*) > /(Ufe( 

m )) - /(w^), 

^/(^/c(m)) Ufc(m)) ^ /(^m) /(^/c(m)) Ufc(7n)|| ■ 

Substituting the above inequalities into ( [l4| ), we obtain 

E ||u„+i - w*||^ + 277 E(/(u„) - /(w*)) 

< ||u™ - w*||^ + p^E llv^ll^ + r]L{\\vLjn - Ufe(m)ll^). (16) 

Since 

m—1 m—1 

||u„ - Ufc(^)f < 2 ^ ||ui+i-Uif = 2p2 ^ Ijvif. 

l—k{m) l—k{m) 

Taking expectation and using Lemma[^ we obtain 
E ||u„+i - w*||^ + 277 E(/(u„i) - /(w,)) 

m— 1 

<E ||u,„ - w*||^ + p^E ||vm||^ + 2?7^L ^ E||v/||^ 

l—k{m) 

m 

<E ||u,„ - w*||^ + ^ E||vi||^ 

l—k{m) 

m 

=E||u„-w*||^+p2 ^ Eqk(i) 

l—k{m) 

m 

<E||u„-w*f^ E® 

l—k(m) 

<E IIu„ - w*||^ + (r + l)p^^77^Egm 

<E ||u„ - w*||^ + 4(r + l)p^^p^LE(/(u„) - /(w*) + /(uq) - /(w*)). 
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The second inequality uses 2Lr] < 1. In the first equality, we use the fact that E 11 v; 11 ^ = 
E(Ei^(||v;|| )) = E( 7 /j(/). The last inequality uses the inequality 

Qm < 4T(/(u„) - /(w*) + /(uo) - /(w*)), 

which has been proved in S). Then summing up from to = 0 to M, and taking 
Wj+i = i or randomly choosing a u„i to be w^+i, we can get 

2Mr]{l - 2{t + l)p^^? 7 L)E(/(wt+i) - /(w*)) (17) 

<E ||wt - w*||^ + 4 M(t + l)p^'"rf‘LE{f(wt) - /(w*)) 

<(- + 4M(t + - /(w*)). 

A* 

Then, we have 


E(/(wt+i) - /(w*)) < ( 


1 

— 2 (t + l)p'^'^r]L) 


2{t + l)p^^r]L 

1 — 2(r + Vjp'^'^pL 


)E(/(wt) - /(w*)). 


Of course, we need 1 — 2(r + Vjp^’^rjL >0. □ 

For the proof of Theorem]^ 

Proof. According to (|^, we have 

E||u„+i - w*||^ = E ||um - w*||^ - 277Ev^(um - w*) + p^E ||vm||^ 

= E ||um - w*||^ - 2r7EV/(um)^(um - w*) + p^E |lvm||^ . 

( 18 ) 


Similar to the analysis of in Theorem [T] we can get 


E||u™+i - w,||^ + 2pE(/(u™) - /(w*)) 

< ||u„ - +ry^E||vm||^ TryL ||u„ - 

m—1 

< ||um - w*||^ + p^E ||vm||^ + 4L77^ ^ E|lvi||^ 

l—a{m) 

< ||u„ - W*||^ +? 7 ^E||vmf + 4 Tp^L? 7 ^E|lvmf 


< ||Um - w*||^ 

< ||um - w*||^ 


+ Arp'^Lp^ 

1 — i — ArTp'^p^Lf 
ALp^ + IQrp'^L^p^ 
1 — 4 — Arrp'^p'^LP' 


Eliv^f 

(/(u™) - /(w*) + /(uo) - /(w*)). 


The second inequality is the same as the analysis in ( [T3| ). The third inequality uses 
Lemma|^ The fourth inequality uses Lemma|^ 
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For convenience, we use ci = sum the above inequality from 

TO = 0 to M — 1, and take Wj+i = i X]m=o^ Then, we obtain 

(277 - C 2 )ME(/(w(+i) - /(w*)) < (- + Mc 2 )E(/(wt) - /(w*)), 
which means that 

E(/(wt+i) - /(w*)) < ( ~ ^ -- + ——)E(/(wt) - /(w*)). 

p,M{2ri-C2) 277 -C2 

□ 
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